Data-driven hypothesis weighting increases detection power in multiple testing

نویسنده

  • NIKOLAOS IGNATIADIS
چکیده

Hypothesis weighting is a powerful approach for improving the power of data analyses that employ multiple testing. However, in general it is not evident how to choose the weights in a data-dependent manner. We describe independent hypothesis weighting (IHW), a method for making use of informative covariates that are independent of the test statistic under the null, but informative of each test’s power or prior probability of the null hypothesis. Covariates can be continuous or categorical and need not fulfill any particular assumptions. The method increases statistical power in applications while controlling the false discovery rate (FDR) and produces additional insight by revealing the covariate-weight relationship. Independent hypothesis weighting is a practical approach to discovery of associations in large datasets. INTRODUCTION Multiple testing is an important part of many high-throughput data analysis workflows. A common objective is control of the FDR, i. e., the expected fraction of false positives among all positives. Algorithms exist that achieve this objective by working solely off the list of p-values from the hypothesis tests [1– 5]. However, such an approach tends to be suboptimal when the individual tests differ in their statistical properties, such as sample size, true effect size, signal-to-noise ratio, or prior probability of being false. For example, in RNA-seq differential gene expression analysis, each hypothesis is associated with a different gene, and because of differences in the number of reads mapped per gene they may greatly differ in their signal-to noise ratio. In genome-wise association studies (GWAS), associations are sought between genetic polymorphisms and phenotypic traits; however, the power to detect an association is lower for rarer polymorphisms (all else being equal). In GWAS of gene expression phenotypes (eQTL), cis-effects are a priori more likely than associations between a gene product and a distant polymorphism. To take into account the different statistical properties of the tests, one can associate each test with a weight, a non-negative number as a measure of its priority. The weights fulfill a budget criterion, commonly that they average to one. Hypotheses with higher weights get prioritized [6]. The procedure of Benjamini and Hochberg (BH) [1] can be modified to allow weighting simply by replacing the original p-values pi with their weighted versions pi/wi (where wi is the weight of hypothesis i) [7]. However, FDR control of this approach is guaranteed only if the weights are pre-specified and thus independent of the data. In practice, the optimal choice of weights is rarely known, and a data-driven method would be desirable [8–12]. However, a method that is generally applicable, shows good performance and ensures type-I error control has been lacking. RESULTS Independent hypothesis weighting (IHW) is a multiple testing procedure that applies the weighted BH method [7] using weights derived from the data. The input to IHW is a two-column table of p-values and covariates. The covariate can be any continuous-valued or categorical variable that is thought to be informative on the statistical properties of the hypothesis tests, while it is independent of the p-value under 1 . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/034330 doi: bioRxiv preprint first posted online Dec. 13, 2015;

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data-driven hypothesis weighting increases detection power in big data analytics

Hypothesis weighting is a powerful approach for improving the power of data analyses that employ multiple testing. However, in general it is not evident how to choose the weights. We describe IHW, a method for data-driven hypothesis weighting that makes use of informative covariates that are independent of the test statistic under the null, but informative of each test’s power or prior probabil...

متن کامل

A New Method for Root Detection in Minirhizotron Images: Hypothesis Testing Based on Entropy-Based Geometric Level Set Decision

In this paper a new method is introduced for root detection in minirhizotron images for root investigation. In this method firstly a hypothesis testing framework is defined to separate roots from background and noise. Then the correct roots are extracted by using an entropy-based geometric level set decision function. Performance of the proposed method is evaluated on real captured images in tw...

متن کامل

Weighted Hypothesis Testing

The power of multiple testing procedures can be increased by using weighted p-values (Genovese, Roeder and Wasserman 2005). We derive the optimal weights and we show that the power is remarkably robust to misspecification of these weights. We consider two methods for choosing weights in practice. The first, external weighting, is based on prior information. The second, estimated weighting, uses...

متن کامل

Ensemble of Data-Driven Prognostic Algorithms with Weight Optimization and K-Fold Cross Validation

The traditional data-driven prognostic approach is to construct multiple candidate algorithms using a training data set, evaluate their respective performance using a testing data set, and select the one with the best performance while discarding all the others. This approach has three shortcomings: (i) the selected standalone algorithm may not be robust, i.e., it may be less accurate when the ...

متن کامل

Analysis and Diagnosis of Partial Discharge of Power Capacitors Using Extension Neural Network Algorithm and Synchronous Detection Based Chaos Theory

Power capacitors are important equipment of the power systems that are being operated in high voltage levels at high temperatures for long periods. As time goes on, their insulation fracture rate increases, and partial discharge is the most important cause of their fracture. Therefore, fast and accurate methods have great importance to accurately diagnosis the partial discharge. Conventional me...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016